Migrating from 2.x - ModeShape 4

This page is primarily for developers that have already used ModeShape 1 or 2, and describes the similarities and differences in ModeShape 3. As you'll see, your application code will likely not have to change much (if at all), since ModeShape 3 still supports the standard JCR API.

However, the biggest change is how ModeShape repositories store their content and how they are configured/managed. We now understood that configuring ModeShape 2 repositories was overly complex, and we wanted to fix that in ModeShape 3. Also, a goal of ModeShape 3 was that we could achieve at least a 10-fold increase in scalability (of clustering and of repository size), and this was simply not possible with the old storage system.

None of these changes were undertaken lightly, and all were made with the goal of making ModeShape 3 easier to configure, easier to use, faster, more resilient, and more scalable. We think that once you learn how ModeShape 3 has changed, you'll really like it.

Public API

Client applications use the standard JCR 2.0 API to interact with ModeShape 2 and ModeShape 3 repositories. So most (if not all) of your application code will not need to be changed. ModeShape 2 did provide a small public API that extended the standard JCR API with a few additional capabilities, and ModeShape 3 supports and slightly expands this public API.

several of the methods and interfaces in ModeShape's public API were deprecated by version 2.8, and these have been removed from the ModeShape 3 API.

ModeShape 3 also passes 100% of the unofficial JSR-283 (JCR 2.0) compatibility tests, as maintained by the reference implementation. (The official TCK has quite a few bugs that have been fixed by the reference implementation community. So although these compatibility tests are not official, we believe these tests are a more accurate representation of the compliance with the intent of the specification. Plus, other implementations use these same tests.)

ModeShape also provided several other APIs:

The RESTful API that was in 2.x is still supported, although the URLs have changed. ModeShape 3 adds a new RESTful API, and this is now the default. This API is cleaner and more capable. The RESTful client library is capable of talking to both ModeShape 2.x and 3.x servers.
The WebDAV API is still supported and has been improved.
The JDBC Driver is still supported and is largely unchanged.

Storage vs. connectors

ModeShape 2.x provided several storage connectors:

the disk-based connector, which stored any content on the local file system
the JPA connector, which stored any content in a JDBC database using Hibernate
the Infinispan connector, which stored any content within an Infinispan 4 or 5 cache
the JBoss Cache connector, which stored any content within a JBoss Cache instance (this was legacy and not recommended for general use)

plus several access connectors:

the file system connector, which could project into the repository nodes that represented the files and folders within an area of the file system
the SVN connector, which could project into the repository nodes that represented the files and folders within an SVN repository
the JCR connector, which could project into the repository nodes that existed in another JCR repository instance

ModeShape 1 and 2 used a single SPI for both kinds of connectors. We learned fairly early that this was not ideal, since the two different access patterns really required dedicated operations. Also, ModeShape didn't originally provide a centralized caching system, and retrofitting one proved to be quite complicated, so each connector tended to implement it's own caching mechanism. Finally, optimizing content operations using the SPIs was proving difficult due to the design of the SPI itself.

One of our goals of ModeShape 3 was to dramatically increase the scalability of a repository, both in terms of scaling out (by clustering multiple ModeShape processes) and in terms of the amount of content. So in the summer of 2011, we embarked on a project to build a new repository engine that could achieve these goals while correcting the problems in the 2.x connector system. We ultimately decided that if a repository put the node representations in an Infinispan cache, Infinispan would act as a very efficient cache and could persist the content using a variety of techniques. On the small end, Infinispan was easily embeddable and could store content on the file system, in databases, or even in the cloud. But for larger scale configurations, a single Infinispan cache can manage our content in-memory by creating an effective "super-heap" across multiple processes and machines (and even spread across multiple data centers), ensuring that several copies of each node are maintained and distributed across the cluster. Infinispan is a data grid that can scale to very large sizes, and ModeShape repositories can benefit from these capabilities.

So in ModeShape 3, a repository always uses a single Infinispan cache as both its caching and storage system. All workspace content (except for binary value storage; see below) and the system content are all stored within this single cache.

Federation

ModeShape 3.0 does not yet support using the JCR API to access information in external systems. That is the most important feature for 3.1, and will reintroduce the concept of a connector as a mechanism to do this. One major difference, however, will be that ModeShape 3.1 will no longer be able to create a repository that consists entirely of federated content. Instead, every ModeShape 3 repository will store its own content, but that you'll also be able to federate and integrate into the repository the content from external systems.

Conceptually this is a bit different than in ModeShape 2, which seemed to allow a repository to be configured such that all content was federated from external systems. Technically, even ModeShape 2 required a storage connector to store the repository's system content, so it was never actually possible to have a repository that consisted entirely of federated content.

Binary storage

Storing large BINARY values was something that ModeShape 2 didn't do very well. That's because very large BINARY values are used differently and really need to be stored differently. For example, BINARY values are always streamed and never need to be pulled completely into memory. They are also immutable, which means they can be treated differently than the rest of the content.

ModeShape 3 was designed to explicitly handle very, very large BINARY values. To do that, ModeShape 3 separates out the storage of binary values into a completely separate store, where the values are stored based upon their SHA-1 hash. This means that the same BINARY value is never stored more than once, even if that BINARY value is used in properties on multiple nodes. ModeShape provides several binary storage options:

Store binary values on the local file system, which can be a regular directory, a network share or even a temporary directory. This option is generally fast, safe (as your file system), and native locks are used to prevent multiple processes from conflicting. It is an excellent choice for local (non-clustered), embedded repositories.
Store binary values in Infinispan. Although it is possible to use the same cache as the rest of the repository uses, it is often far better to use two other caches and to configure those caches specifically for what they store. For example, one cache is used to store metadata about the BINARY values; this metadata is small, lightweight, and can thus be replicated across the cluster. The other cache is used to store the actually binary data, separated into chunks (usually up to 1MB in size), and for this cache distributing the data across the cluster is often desirable.
Store binary values in a relational database. This storage is recommended only when you are expected to persist all content inside a relational database. The binary data is broken into chunks (usually up to 1MB in size).
Store binary values in MongoDB. This storage option has not been thoroughly tested, but can be considered as an option.
Store binary values in a custom store. ModeShape 3 provides an SPI for implementing your own binary storage.

Sequencers

ModeShape 3 sequencers work exactly the same way as they did in ModeShape 1.x and 2.x: they automatically take new or updated content (matched by a path-based rule), generate additional structured content, and write that new content into the repository (in a location determined by the configuration).

They are configured differently, most notably because each repository is configured with its own sequencers. Implementing custom sequencers, however, is far easier in 3.0, since sequencers generate the additional content by directly using the JCR API rather than the proprietary graph API in ModeShape 2. Sequencer implementations are also able to register the node types programmatically, which simplifies the overall configuration for a repository.

MIME type detection

ModeShape 1 and 2 had the ability to automatically detect the MIME type for BINARY values. Several detectors were provided by ModeShape, and these very often didn't need to be customized or altered.

ModeShape 3 also has the same ability, but we've made several improvements. First of all, we've added two new methods to the org.modeshape.jcr.api.Binary interface (which extends the standard javax.jcr.Binary interface) to obtain the MIME type. Your applications can use these methods to discover the MIME type for a BINARY value and, for example, to set the "jcr:mimeType" property on the node.

Secondly, we've removed our SPI for custom MIME type detectors. Instead, ModeShape 3 simply uses the Apache Tika framework, which has several MIME type detectors and provides its own SPI for custom detectors.

Text extractors

Text extractors also work exactly the same was as they did in ModeShape 1.x and 2.x. Their purpose is to extract searchable text from BINARY values, so that full-text search and queries are able to find results that matched the content of BINARY values. ModeShape 3 continues to support several built-in extractors, including one that uses Apache Tika. However, in ModeShape 3.x we've added a simple SPI so that you can easily create your own extractors.

Configuration and running the engine

ModeShape 2 configuration files contains specifications for multiple repositories. Some components, like sequencers and repository sources, were configured separately from the repositories. To run an engine, you first read in the configuration file and then created an engine from the configuration:

// Load the one configuration file ...
JcrConfiguration configuration = new JcrConfiguration();
configuration.loadFrom("modeshape-config.xml");

// Create and start an engine ...
JcrEngine engine = configuration.build();
engine.start();

// Get the repositories by their names ...
Repository repository1 = engine.getRepository("Cars");
Repository repository2 = engine.getRepository("Catalog");

Using a single configuration file for the engine seemed to make sense, but it was also confusing because a single sequencer might be used in multiple repositories. It was also potentially problematic, because a single source might be used by multiple repositories, even though this was not allowed. ModeShape 2 didn't allow modifying the configuration while the engine was running, which meant it was not possible to dynamically add or remove repositories without completely shutting down and restarting the engine. (In reality, very little was shared between repositories.)

ModeShape 3 separates the configuration of each repository into a separate file, which are each "deployed" to an engine:

// Start the engine ...
ModeShapeEngine engine = new ModeShapeEngine();
engine.start();

// Deploy and repository 1 ...
RepositoryConfiguration config1 = RepositoryConfiguration.read("cars-config.json");
engine.deploy(config1);
Repository repository1 = engine.getRepository("Cars");

// Deploy and use repository 2 ...
RepositoryConfiguration config2 = RepositoryConfiguration.read("catalog-config.json");
engine.deploy(config2);
Repository repository2 = engine.getRepository("Catalog");

// Undeploy repository 1 ...
engine.undeploy("Cars");

As you can see, it's now possible to dynamically deploy and undeploy repositories even when the engine is running and other repositories are in use. There are multiple ways of reading in the configuration, too:

read from a java.io.File
read from a resolved java.net.URL
read from a String containing a URL or a path to a file on the file system or classpath
read a string containing the configuration

You might have also noticed in the example above that ModeShape 3 configuration files are JSON files, not XML files like in ModeShape 1 and 2. We thought that XML configuration files are noisy and make it difficult to see the bigger picture. JSON files, on the other hand, are quite easy to read and edit. And ModeShape does use a JSON Schema that dictates the allowed structure of the configuration files, so ModeShape can even validate your configuration files:

Problems problems = config1.validate();
if ( problems.hasProblems() ) {
    // Output a summary of the problems (with line numbers) ...
    System.out.println(problems);
}

ModeShape 3 configuration files also have sensible defaults for everything, so this file is actually a valid configuration for a repository named "my-repo":

my-repo.json

{ "name" : "my-repo" }

Of course, you'll likely want to specify more options, so here is another example of a repository with most of the available options specified:

my-repo-config.json

{
    "name" : "my-repo",
    "transactionMode" : "auto",
    "monitoring" : {
        "enabled" : true,
    },
    "workspaces" : {
        "predefined" : ["otherWorkspace"],
        "default" : "default",
        "allowCreation" : true,
        "initialContent" : {
            "ws1" : "file1.xml",
            "ws2" : "file2.xml",
            "*" : "default.xml"
        }
    },
    "node-types" : ["file1.cnd", "file2.cnd"],
    "storage" : {
        "cacheName" : "Thorough",
        "cacheConfiguration" : "infinispan_configuration.xml",
        "transactionManagerLookup" = "org.infinispan.transaction.lookup.GenericTransactionManagerLookup",
        "binaryStorage" : {
            "type" : "file",
            "directory" : "Thorough/binaries",
            "minimumBinarySizeInBytes" : 4096
        }
    },
    "security" : {
        "anonymous" : {
            "username" : "<anonymous>",
            "roles" : ["readonly","readwrite","admin"],
            "useOnFailedLogin" : false
        },
        "providers" : [
            {
                "name" : "My Custom Security Provider",
                "classname" : "com.example.MyAuthenticationProvider",
            },
            {
                "classname" : "jaas",
                "policyName" : "modeshape-jcr",
            }
        ]
    },
    "query" : {
        "enabled" : true,
        "textExtracting": {
            "threadPool" : "test",
                "extractors" : {
                    "customExtractor": {
                        "name" : "MyFileType extractor",
                        "classname" : "com.example.myfile.MyExtractor",
                    },
                    "tikaExtractor":{
                        "name" : "General content-based extractor",
                        "classname" : "tika",
                    }
            }
         },
         "indexStorage" : {
            "type" : "filesystem",
            "location" : "Thorough/indexes",
            "lockingStrategy" : "native",
            "fileSystemAccessType" : "auto"
        },
        "indexing" : {
            "threadPool" : "modeshape-workers",
            "analyzer" : "org.apache.lucene.analysis.standard.StandardAnalyzer",
            "similarity" : "org.apache.lucene.search.DefaultSimilarity",
            "batchSize" : -1,
            "indexFormat" : "LUCENE_35",
            "readerStrategy" : "shared",
            "mode" : "sync",
            "rebuildOnStartup": {
                "when" : "if_missing",
                "includeSystemContent": false,
                "mode": "sync"
            },
            "asyncThreadPoolSize" : 1,
            "asyncMaxQueueSize" : 0,
            "backend" : {
                "type" : "lucene",

            },
            "hibernate.search.custom.overridden.property" : "value",
        }
    },
    "sequencing" : {
        "removeDerivedContentWithOriginal" : true,
        "threadPool" : "modeshape-workers",
        "sequencers" : {
            "zipSequencer" : {
                "classname" : "ZipSequencer",
                "pathExpressions" : ["default:/files(//)(*.zip[*])/jcr:content[@jcr:data] => default:/sequenced/zip/$1"],
            },
            "delimitedTextSequencer" : {
                "classname" : "org.modeshape.sequencer.text.DelimitedTextSequencer",
                "pathExpressions" : [
                    "default:/files//(*.csv[*])/jcr:content[@jcr:data] => default:/sequenced/text/delimited/$1"
                ],
                "splitPattern" : ","
            }
        }
    },
    "clustering" : {
    }
}

See our documentation about the ModeShape JSON configuration file format for more information.

It is also possible to access the configuration of a running repository, change the configuration, and then update the running repository:

// Get the configuration ...
RepositoryConfiguration config = repository1.getConfiguration();

// Edit the configuration (which is a JSON document) to change a value ...
Editor editor = config.edit();
editor.getOrCreateDocument(FieldName.STORAGE)
      .getOrCreateDocument(FieldName.BINARY_STORAGE);
      .setNumber(FieldName.MINIMUM_BINARY_SIZE_IN_BYTES, newLargeValueSizeInBytes);
Changes changes = editor.getChanges();

// Apply the changes to the deployed repository ...
Future<Boolean> future = engine.update(config.getName(), changes);

// And optionally wait until the repository configuration is updated ...
future.get();

Many configuration changes can be applied to a repository while it is running, but not everything. For example, changing where data is stored will apply only after the repository is shutdown and restarted.

Migrating content

ModeShape 3.0 provides an efficient backup and restore capability that works at the repository level. This means that each backup will contain all of the content in all of the workspaces of a single repository. Backups can be used to recovered a repository back to an earlier state (due to a corruption, hardware failure, etc.), and it also serves as a mechanism for migrating ModeShape 2.x repositories to 3.x.

The ModeShape 2.8.4 distribution will contain a utility that, given a ModeShape 2.x configuration, access the configured connectors to process all of the persisted content and output it into a 3.0-compatible backup format. This utility should be run while the 2.x repository no longer being used, and the resulting backup files can then be used to "restore" a 3.0 repository so that it has the same content as the 2.x repository.

There's also one huge advantage to this approach: if the 2.x backup fails, simply clean up the files on the file system and start the backup process again. This process also modifies none of the 2.x storage, so it's safe to do multiple times.

This process will copy the data from the 2.x storage system into the Infinispan cache. Only remove the old data after the new repository has been verified.

JBoss AS

One other major change is that ModeShape 3 can be installed into JBoss AS7, which is a very fast and lightweight application server. The integration is very good: ModeShape is a service (or _subsystem) within AS7, and is configured and managed using the regular AS7 configuration files or tooling. Managing a ModeShape instance across a JBoss AS7 domain (cluster) is just as easy as with any other AS7 subsystem. Plus, ModeShape just uses AS7's built-in support for Infinispan, JGroups, security, and data sources, which means you configure these components using AS7's tools.

ModeShape 3 no longer provides integration with JBoss AS 5 and 6.

ModeShape 3 can of course be used with other application and web servers, including JBoss AS5 and 6. But just like with ModeShape 2, doing so basically just embeds ModeShape within your web application or service, and no other integration with the server is provided.